Extending Metadata Definitions by Automatically Extracting and Organizing Glossary Definitions

نویسندگان

  • Eduard H. Hovy
  • Andrew Philpot
  • Judith L. Klavans
  • Ulrich Germann
  • Peter T. Davis
چکیده

Metadata descriptions of database contents are required to build and use systems that access and deliver data in response to user requests. When numerous heterogeneous databases are brought together in a single system, their various metadata formalizations must be homogenized and integrated in order to support the access planning and delivery system. This integration is a tedious process that requires human expertise and attention. In this paper we describe a method of speeding up the formalization and integration of new metadata. The method takes advantage of the fact that databases are often described in web pages containing natural language glossaries that define pertinent aspects of the data. Given a root URL, our method identifies likely glossaries, extracts and formalizes aspects of relevant concepts defined in them, and automatically integrates the new formalized metadata concepts into a large model of the domain and associated conceptualizations. 1. Background: The EDC System The Digital Government Research Center (DGRC) has been building a data delivery system called the Energy Data Collection (EDC) project (Ambite et al. 02; Philpot et al. 02; Hovy et al. 01). The system delivers to users data about the prices and volumes of petroleum product sales, as collected and recorded by the Energy Information Administration (EIA; see http://www.eia.doe.gov), the Bureau of Labor Statistics (BLS), the Census Bureau, and the California Energy Commission (CEC). The system currently provides access to over 58,000 data tables provided by these agencies. These tables are stored in many formats, including Microsoft Access databases, flat files of numbers, html pages, PDF files, etc. To provide dynamically planned access to so many non-homogeneous databases, the system employs the query planner SIMS (Arens et al. 96) that decomposes a user’s query into smaller manageable plan of subqueries (Ambite and Knoblock 00), expresses them in SQL, retrieves the appropriate data, and recombines it as required. In order to determine how to decompose the query, and to determine where to locate each specific type of data, SIMS employs a formal model of the domain in which each type of data is represented by an entity (which we call a concept) appropriately taxonomized in relation to other similar concepts. The domain model contains approximately 500 concepts, arranged in some dozen small taxonomies, each subtaxonomy modeling one pertinent aspect of the data in the Energy Domain. For example, one taxonomy models locations (states, areas, districts, etc.), another models units of measure (gallons, barrels, etc.), and a third models frequency of measurement (monthly, biweekly, etc.). The leaf concepts of each subtaxonomy directly and fully express the semantics of some aspects of the data in one or more databases in the collection. The formal equivalence holding between the leaf concepts and the data ensures that the SIMS planner can confidently plan with the concepts and return the data they indicate as responsive to the user’s query. In addition to helping SIMS plan its queries, the domain model has another function. Using a browser interface provided by the EDC system, the user can explore the taxonomies of the domain model in order to become familiar with the specifics of the data collections. Should two databases have slightly different definitions for the Mid-Atlantic region, for example, there will be two corresponding concepts in the subtaxonomy modeling Areas, with associated textual descriptions and possibly hyperlinks to associated online text. Such details allow the user to ensure that he or she is thoroughly familiar with exactly the data being requested from the system. The domain model is the minimal collection of concepts required to define the data. Such minimalism is generally appreciated by the system builder and domain experts, but can lead to problems. Since the domain model contains no higher-level generalization concepts, it has no overarching organizing structure to relate its subtaxonomies to each other. One consequence is that non-expert users, who generally do not know the domain terminology, can have difficulty finding what they need or orienting themselves in the domain model. Another consequence is that adding new concepts that represent hitherto unmodeled aspects of (new) databases is difficult, since there are no obvious attachment points in the subtaxonomies. We have therefore embedded the domain model into a very large general-purpose taxonomy of some 110,000 concepts called the Omega ontology (Hovy et al. 03). Omega, a successor to the SENSUS ontology built at ISI (Knight and Luk 94), likewise contains Princeton’s WordNet (Fellbaum 98), but includes also New Mexico State University’s Mikrokosmos/OntoSem (Mahesh and Nirenburg 95) and a wholly new upper structure. Using a partially automated linking process described below, the domain model concepts have been indexed into Omega, allowing the casual user to provide almost any English word to the system’s ontology browser in order to locate closely associated domain model concepts. Omega concepts are connected to the domain model with what we call Generally Associated With (GAW) links. In contrast to the links from the domain model to the databases, which are formal and exact, the GAW links are less exact, more in the nature of free association, in order to connect the user’s possibly ambiguous or slightly misphrased term choices to the right places. Embedding the domain model into Omega has a second major benefit. When new databases are added to the system, new metadata concepts must be added to the domain model as well. When a new concept is a close variant of an existing domain concept, this is not a problem, but when it is rather different, or perhaps something altogether outside the current domain, the modeler requires some appropriate position for it. Omega, being a very general domain-independent taxonomy, provides a first approximation. We describe in (Hovy et al. 02) experiments in developing automated methods of linking domain concepts into Omega (or, at that time, SENSUS), using heuristics that compare concepts’ names, textual definitions, and hierarchical organization across taxonomies. In (Philpot et al. 02) we describe AskCal, a natural language interface that converts the user’s English question into a SIMS query, possibly via a cascaded menu interface based upon the domain model contents. Work is underway to create equivalent AskCal interfaces for Spanish and Chinese. In this paper, we address a problem at the center of all efforts to provide access to multiple heterogeneous databases: how can one (help) automate the process of domain modeling? In particular, given a new database, how can one produce, even tentatively, new concepts for the domain and/or Omega taxonomies to express unmode led aspects of the new data? If one can succeed in doing so, the task of the domain modeler is considerably facilitated; modeling becomes a process of checking the postulated concepts for definitional correctness, appropriate placement in the taxonomy, and correct cross-linkage. 2. Growing the Domain Model Automatically In this section we describe our method, still under development, that involves components from its precursor GlossIT (Klavans et al. 02). The method is based on the assumption that new databases are accompanied by textual material in the form of online glossaries (Muresan et al. 03). Frequently, experts create glossaries as a means of specifying carefully and explicitly their work. Usually, glossary terms are not general-purpose English words but very specific variants of them, as used by the experts in their particular endeavor. These definitions thus tend to exhibit a rather standardized structure, something a text analysis program can exploit. In the rest of this section we describe each stage of our method. 2.1 Glossary Identification--GetGloss The first step of our process involves identifying glossary files within a website, given a root URL. GlossIT contains the GetGloss module that performs this function. Recursively, GetGloss vis its each page under a target URL, down to depth five. A filter inspects each page. Likely glossaries are accepted and their URLs are added to the recursive search list. The principal technical problem is creating a high-precision filter. Variations in content, style, and html usage must be handled (for example, non-standard usage, such as using “” and “” to delineate headwords, occurs frequently). The approach taken was page categorization (into the categories glossary or not-glossary), using a set of hand-crafted rules that searched for content keywords, certain html constructs, and other identifying marks. To test the adequacy of these rules, we also categorized using several algorithms provided in the Rainbow text categorization toolkit (McCallum 96), as well as the Ripper system (Cohen 95). As described in Section 3.1, Rainbow’s Support Vector Machine module performed extremely well. However, care must be taken to train the modules on pages with the same characteristics as the target. 2.2 Concept Formation The task of concept formation is to identify and delimit the text describing each concept, locate within each text string specific aspects of interest, extract and convert them into the format required, if appropriate cross-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-line glossary compilation

Nowadays, the development of the Internet has created massive amounts of documents available on-line. This not only gives us the opportunity to access various sources of knowledge, but also raises an issue of how to efficiently exploit this knowledge. In this project, we present a system to assist users in extracting and clarifying terms from free-form documents. We developed Glossary Compiler,...

متن کامل

Evolutionary Algorithms for Definition Extraction

Books and other text-based learning material contain implicit information which can aid the learner but which usually can only be accessed through a semantic analysis of the text. Definitions of new concepts appearing in the text are one such instance. If extracted and presented to the learner in form of a glossary, they can provide an excellent reference for the study of the main text. One way...

متن کامل

Combining pattern-based and machine learning methods to detect definitions for eLearning purposes

One of the aims of the Language Technology for eLearning project is to show that Natural Language Processing techniques can be employed to enhance the learning process. To this end, one of the functionalities that has been developed is a pattern-based glossary candidate detector which is capable of extracting definitions in eight languages. In order to improve the results obtained with the patt...

متن کامل

Automated detection and annotation of term definitions in German text corpora

We describe an approach to automatically detect and annotate definitions for technical terms in German text corpora. This approach focuses on verbs that typically appear in definitions (= definitor verbs). We specify search patterns based on the valency frames of these definitor verbs and use them (1) to detect and delimit text segments containing definitions and (2) to annotate their main func...

متن کامل

GlossExtractor: A Web Application to Automatically Create a Domain Glossary

We describe a web application, GlossExtractor, that receives in input the output of a terminology extraction web application, TermExtractor, or a user-provided terminology, and then searches several repositories (on-line glossaries, web documents, user-specified web pages) for sentences that are candidate definitions for each of the input terms. Candidate definitions are then filtered using sta...

متن کامل

Automatic Definition Extraction using Parser Combinators

The automatic extraction of definitions from natural language texts has various applications such as the creation of glossaries and question-answering systems. In this paper we look at the extraction of definitions from non-technical texts using parser combinators in Haskell. We argue that this approach gives a general and compositional way of characterising natural language definitions. The pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003